This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/\(\mu\)L Saccharomyces cerevisiae strain BY4741. Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/\(\mu\)L) and 6B (0.74 fmol UPS1 proteins/\(\mu\)L) [5]. We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016) [1]. Three replicates are available for each concentration.
We first import the data from peptideRaws.txt file. This is the file containing
your peptideRaw-level intensities. For a MaxQuant search [6],
this peptideRaws.txt file can be found by default in the
“path_to_raw_files/combined/txt/” folder from the MaxQuant output,
with “path_to_raw_files” the folder where the raw files were saved.
In this vignette, we use a MaxQuant peptideRaws file which is a subset
of the cptac study. This data is available in the msdata package.
To import the data we use the QFeatures package.
We generate the object peptideRawFile with the path to the peptideRaws.txt file.
Using the grepEcols function, we find the columns that contain the expression
data of the peptideRaws in the peptideRaws.txt file.
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/SGA2020/data/quantification/cptacAvsB_lab3/peptides.txt"
ecols <- MSnbase::grepEcols(
peptidesFile,
"Intensity ",
split = "\t")
pe <- readQFeatures(
table = peptidesFile,
fnames = 1,
ecol = ecols,
name = "peptideRaw", sep="\t")
In the following code chunk, we can extract the spikein condition from the raw file name.
cond <- which(
strsplit(colnames(pe)[[1]][1], split = "")[[1]] == "A") # find where condition is stored
colData(pe)$condition <- substr(colnames(pe), cond, cond) %>%
unlist %>%
as.factor
We calculate how many non zero intensities we have per peptide and this will be useful for filtering.
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
Peptides with zero intensities are missing peptides and should be represent
with a NA value rather than 0.
pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA
We can inspect the missingness in our data with the plotNA() function
provided with MSnbase.
45% of all peptide
intensities are missing and for some peptides we do not even measure a signal
in any sample. The missingness is similar across samples.
MSnbase::plotNA(assay(pe[["peptideRaw"]])) +
xlab("Peptide index (ordered by data completeness)")
This section preforms standard preprocessing for the peptide data. This include log transformation, filtering and summarisation of the data.
pe <- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog")
limma::plotDensities(assay(pe[["peptideLog"]]))
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe[["peptideLog"]] <-
pe[["peptideLog"]][rowData(pe[["peptideLog"]])$Proteins
%in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins),]
We now remove the contaminants, peptides that map to decoy sequences, and proteins which were only identified by peptides with modifications.
pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$Reverse != "+", ]
pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$
Potential.contaminant != "+", ]
I will skip this step for the moment. Large protein groups file needed for this.
We keep peptides that were observed at last twice.
pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$nNonZero >= 2, ]
nrow(pe[["peptideLog"]])
## [1] 7011
We keep 7011 peptides after filtering.
pe <- normalize(pe, i = "peptideLog", method = "quantiles", name = "peptideNorm")
After quantile normalisation the density curves for all samples coincide.
limma::plotDensities(assay(pe[["peptideNorm"]]))
This is more clearly seen is a boxplot.
boxplot(assay(pe[["peptideNorm"]]), col = palette()[-1],
main = "Peptide distribtutions after normalisation", ylab = "intensity")
We can visualize our data using a Multi Dimensional Scaling plot,
eg. as provided by the limma package.
limma::plotMDS(assay(pe[["peptideNorm"]]), col = as.numeric(colData(pe)$condition))
The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptideRaw data seems to be driven by technical variability. Indeed, the samples do not seem to be clearly separated according to the spike-in condition.
We use robust summarization in aggregateFeatures. This is the default workflow of aggregateFeatures so you do not have to specifiy the argument fun.
However, because we compare methods we have included the fun argument to show the summarization method explicitely.
pe <- aggregateFeatures(pe,
i = "peptideNorm",
fcol = "Proteins",
na.rm = TRUE,
name = "proteinRobust",
fun = MsCoreUtils::robustSummary)
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
We notice that the leading differences (log FC) in the protein data are still according to technical variation.
plotMDS(assay(pe[["proteinRobust"]]), col = as.numeric(colData(pe)$condition))
We model the protein level expression values using msqrob.
By default msqrob2 estimates the model parameters using robust regression.
pe <- msqrob(object = pe, i = "proteinRobust", formula = ~condition)
First, we extract the parameter names of the model.
getCoef(rowData(pe[["proteinRobust"]])$msqrobModels[[1]])
## (Intercept) conditionB
## 15.060948 1.564747
Spike-in condition a is the reference class. So the mean log2 expression for samples from condition a is ‘(Intercept). The mean log2 expression for samples from condition B is’(Intercept)+conditionB’. Hence, the average log2 fold change between condition b and condition a is modelled using the parameter ‘conditionB’. Thus, we assess the contrast ‘conditionB=0’ with our statistical test.
L <- makeContrast("conditionB=0", parameterNames = c("conditionB"))
pe <- hypothesisTest(object = pe, i = "proteinRobust", contrast = L)
volcano <- ggplot(rowData(pe[["proteinRobust"]])$conditionB,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcano
We first select the names of the proteins that were declared signficant.
sigNames <- rowData(pe[["proteinRobust"]])$conditionB %>%
rownames_to_column("proteinRobust") %>%
filter(adjPval<0.05) %>%
pull(proteinRobust)
heatmap(assay(pe[["proteinRobust"]])[sigNames, ])
We make boxplot of the log2 FC and stratify according to the whether a protein is spiked or not.
rowData(pe[["proteinRobust"]])$conditionB %>%
rownames_to_column(var = "protein") %>%
ggplot(aes(x=grepl("UPS",protein),y=logFC)) +
geom_boxplot() +
xlab("UPS") +
geom_segment(
x = 1.5,
xend = 2.5,
y = log2(0.74/0.25),
yend = log2(0.74/0.25),
colour="red") +
geom_segment(
x = 0.5,
xend = 1.5,
y = 0,
yend = 0,
colour="red") +
annotate(
"text",
x = c(1,2),
y = c(0,log2(0.74/0.25))+.1,
label = c(
"log2 FC Ecoli = 0",
paste0("log2 FC UPS = ",round(log2(0.74/0.25),2))
),
colour = "red")
## Warning: Removed 167 rows containing non-finite values (stat_boxplot).
What do you observe?
We first extract the normalized peptideRaw expression values for a particular protein.
for (protName in sigNames)
{
pePlot <- pe[protName, , c("peptideNorm","proteinRobust")]
pePlotDf <- data.frame(longFormat(pePlot))
pePlotDf$assay <- factor(pePlotDf$assay,
levels = c("peptideNorm", "proteinRobust"))
pePlotDf$condition <- as.factor(colData(pePlot)[pePlotDf$colname, "condition"])
# plotting
p1 <- ggplot(data = pePlotDf,
aes(x = colname, y = value, group = rowname)) +
geom_line() + geom_point() + theme_minimal() +
facet_grid(~assay) + ggtitle(protName)
print(p1)
# plotting 2
p2 <- ggplot(pePlotDf, aes(x = colname, y = value, fill = condition)) +
geom_boxplot(outlier.shape = NA) + geom_point(position = position_jitter(width = .1),
aes(shape = rowname)) +
scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "peptide intensity (log2)") + theme_minimal()
facet_grid(~assay)
print(p2)
}